Since Airbnb launched just over 10 years ago, it has already disrupted the traveling and hospitality industry. The platform is also a very lucrative income generation tools for house owners. A report published by The Straits Times in December 2016 indicates that an average Singapore Airbnb host makes about $5,000 a year.
In order to attract tourist and guest, it is very important to obtain a good review scores rating and a suitable price per night in the Airbnb listing.
This project consists of 2 parts:
Part 1 address the two main questions on the factors that are associated with the review scores rating:
Part 2 address the two main questions on the factors that are associated with the price per night:
Our questions are not confirmatory in nature; rather, they are exploratory. We hope that the investigation for these questions will provide valuable insights to home owners who would like to host their rooms or units on Airbnb Singapore platform.
Inside Airbnb scrape the data periodically from the official AirBnB website. For this project will use the Singapore dataset scraped by Inside Airbnb on 21 March 2020 before the Singapore Circuit Breaker started in beginning of April 2020. We believed if we used the April dataset it could be possibly skewed or introduced more noise to the data.
library(tidyverse)
data_source <- "http://data.insideairbnb.com/singapore/sg/singapore/2020-03-21/data/listings.csv.gz"
airbnb_listings <- read_csv(data_source)
glimpse(airbnb_listings)
## Rows: 7,713
## Columns: 106
## $ id <dbl> 49091, 50646, 56334, 716…
## $ listing_url <chr> "https://www.airbnb.com/…
## $ scrape_id <dbl> 2.020032e+13, 2.020032e+…
## $ last_scraped <date> 2020-03-21, 2020-03-21,…
## $ name <chr> "COZICOMFORT LONG TERM S…
## $ summary <chr> NA, "Fully furnished bed…
## $ space <chr> "This is Room No. 2.(ava…
## $ description <chr> "This is Room No. 2.(ava…
## $ experiences_offered <chr> "none", "none", "none", …
## $ neighborhood_overview <chr> NA, "The serenity & quie…
## $ notes <chr> NA, "Accommodation has a…
## $ transit <chr> NA, "Less than 400m from…
## $ access <chr> NA, "Kitchen, washing fa…
## $ interaction <chr> NA, "We love to host peo…
## $ house_rules <chr> "No smoking indoors. Ple…
## $ thumbnail_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ medium_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ picture_url <chr> "https://a0.muscache.com…
## $ xl_picture_url <lgl> NA, NA, NA, NA, NA, NA, …
## $ host_id <dbl> 266763, 227796, 266763, …
## $ host_url <chr> "https://www.airbnb.com/…
## $ host_name <chr> "Francesca", "Sujatha", …
## $ host_since <date> 2010-10-20, 2010-09-08,…
## $ host_location <chr> "singapore", "Singapore,…
## $ host_about <chr> "I am a private tutor by…
## $ host_response_time <chr> "N/A", "within a day", "…
## $ host_response_rate <chr> "N/A", "100%", "N/A", "1…
## $ host_acceptance_rate <chr> "N/A", "N/A", "N/A", "10…
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, FAL…
## $ host_thumbnail_url <chr> "https://a0.muscache.com…
## $ host_picture_url <chr> "https://a0.muscache.com…
## $ host_neighbourhood <chr> "Woodlands", "Bukit Tima…
## $ host_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 0, …
## $ host_total_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 0, …
## $ host_verifications <chr> "['email', 'phone', 'fac…
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ host_identity_verified <lgl> FALSE, FALSE, FALSE, TRU…
## $ street <chr> "Singapore, Singapore", …
## $ neighbourhood <chr> "Woodlands", "Bukit Tima…
## $ neighbourhood_cleansed <chr> "Woodlands", "Bukit Tima…
## $ neighbourhood_group_cleansed <chr> "North Region", "Central…
## $ city <chr> "Singapore", "Singapore"…
## $ state <chr> NA, NA, NA, NA, NA, NA, …
## $ zipcode <chr> "730702", "589664", NA, …
## $ market <chr> "Singapore", "Singapore"…
## $ smart_location <chr> "Singapore", "Singapore"…
## $ country_code <chr> "SG", "SG", "SG", "SG", …
## $ country <chr> "Singapore", "Singapore"…
## $ latitude <dbl> 1.44255, 1.33235, 1.4424…
## $ longitude <dbl> 103.7958, 103.7852, 103.…
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ property_type <chr> "Apartment", "Apartment"…
## $ room_type <chr> "Private room", "Private…
## $ accommodates <dbl> 1, 2, 1, 6, 3, 3, 6, 2, …
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.0, 0.5,…
## $ bedrooms <dbl> 1, 1, 1, 2, 1, 1, 1, 1, …
## $ beds <dbl> 1, 1, 1, 3, 1, 2, 7, NA,…
## $ bed_type <chr> "Real Bed", "Real Bed", …
## $ amenities <chr> "{TV,\"Cable TV\",Intern…
## $ square_feet <dbl> 0, NA, 0, 205, NA, NA, 4…
## $ price <chr> "$87.00", "$80.00", "$72…
## $ weekly_price <chr> NA, "$400.00", NA, NA, "…
## $ monthly_price <chr> "$1,087.00", "$1,600.00"…
## $ security_deposit <chr> NA, NA, NA, "$290.00", "…
## $ cleaning_fee <chr> NA, NA, NA, "$58.00", "$…
## $ guests_included <dbl> 1, 2, 1, 4, 1, 1, 4, 1, …
## $ extra_people <chr> "$14.00", "$20.00", "$14…
## $ minimum_nights <dbl> 180, 90, 6, 1, 1, 1, 1, …
## $ maximum_nights <dbl> 360, 730, 14, 1125, 1125…
## $ minimum_minimum_nights <dbl> 180, 90, 6, 1, 1, 1, 1, …
## $ maximum_minimum_nights <dbl> 180, 90, 6, 1, 1, 1, 1, …
## $ minimum_maximum_nights <dbl> 360, 730, 14, 1125, 1125…
## $ maximum_maximum_nights <dbl> 360, 730, 14, 1125, 1125…
## $ minimum_nights_avg_ntm <dbl> 180, 90, 6, 1, 1, 1, 1, …
## $ maximum_nights_avg_ntm <dbl> 360, 730, 14, 1125, 1125…
## $ calendar_updated <chr> "70 months ago", "68 mon…
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ availability_30 <dbl> 30, 30, 30, 30, 30, 28, …
## $ availability_60 <dbl> 60, 60, 60, 60, 60, 51, …
## $ availability_90 <dbl> 90, 90, 90, 90, 90, 81, …
## $ availability_365 <dbl> 365, 365, 365, 365, 365,…
## $ calendar_last_scraped <date> 2020-03-21, 2020-03-21,…
## $ number_of_reviews <dbl> 1, 18, 20, 20, 24, 48, 2…
## $ number_of_reviews_ltm <dbl> 0, 0, 0, 8, 4, 17, 6, 0,…
## $ first_review <date> 2013-10-21, 2014-04-18,…
## $ last_review <date> 2013-10-21, 2014-12-26,…
## $ review_scores_rating <dbl> 94, 91, 98, 89, 83, 88, …
## $ review_scores_accuracy <dbl> 10, 9, 10, 9, 8, 9, 9, N…
## $ review_scores_cleanliness <dbl> 10, 10, 10, 8, 8, 9, 8, …
## $ review_scores_checkin <dbl> 10, 10, 10, 9, 9, 9, 9, …
## $ review_scores_communication <dbl> 10, 10, 10, 10, 9, 9, 9,…
## $ review_scores_location <dbl> 8, 9, 8, 9, 8, 9, 9, NA,…
## $ review_scores_value <dbl> 8, 9, 9, 9, 8, 9, 8, NA,…
## $ requires_license <lgl> FALSE, FALSE, FALSE, FAL…
## $ license <lgl> NA, NA, NA, NA, NA, NA, …
## $ jurisdiction_names <lgl> NA, NA, NA, NA, NA, NA, …
## $ instant_bookable <lgl> FALSE, FALSE, FALSE, TRU…
## $ is_business_travel_ready <lgl> FALSE, FALSE, FALSE, FAL…
## $ cancellation_policy <chr> "flexible", "moderate", …
## $ require_guest_profile_picture <lgl> TRUE, FALSE, TRUE, FALSE…
## $ require_guest_phone_verification <lgl> TRUE, TRUE, TRUE, TRUE, …
## $ calculated_host_listings_count <dbl> 2, 1, 2, 8, 8, 8, 8, 1, …
## $ calculated_host_listings_count_entire_homes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ calculated_host_listings_count_private_rooms <dbl> 2, 1, 2, 8, 8, 8, 8, 1, …
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reviews_per_month <dbl> 0.01, 0.25, 0.19, 0.20, …
The data consists of 7713 observations with 106 variables.
The variables that will be selected in this project are:
Review Scores Rating is the aggregation for overall star rating and a set of category star ratings guest submitted in the Airbnb platform for their stay.
The values ranges between 20 to 100, 100 being the highest, and 20 being lowest.
As the words suggest, it is the total number of reviews submitted by guests who stayed in the room/unit. The higher the number of reviews, the better is the accuracy of the Review Scores Rating.
Essential amenities are basic items that a guest expects in order to have a comfortable stay .
These include:
Airbnb gives hosts the Host is superhost status for hosts that provides consistent exceptional high quality of experiences to guests . This variable takes the value of true or false.
The latitude and longitude of the room/unit as posted on the AirBnB site
The price of the room/unit for a night stay.
Since review_scores_rating will be used as the response variable, we drop rows where review_scores_rating is undefined (na).
airbnb_data <- airbnb_listings %>%
drop_na(review_scores_rating) %>%
select(c(review_scores_rating, number_of_reviews, price,
host_is_superhost, amenities, latitude, longitude))
glimpse(airbnb_data)
## Rows: 4,763
## Columns: 7
## $ review_scores_rating <dbl> 94, 91, 98, 89, 83, 88, 82, 99, 99, 98, 89, 90, …
## $ number_of_reviews <dbl> 1, 18, 20, 20, 24, 48, 29, 176, 199, 236, 133, 1…
## $ price <chr> "$87.00", "$80.00", "$72.00", "$214.00", "$99.00…
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ amenities <chr> "{TV,\"Cable TV\",Internet,Wifi,\"Air conditioni…
## $ latitude <dbl> 1.44255, 1.33235, 1.44246, 1.34541, 1.34567, 1.3…
## $ longitude <dbl> 103.7958, 103.7852, 103.7967, 103.9571, 103.9596…
Next we create 6 independent variables as:
superhost_status: Indicator to show if the host is superhost.popular_status: Listings with more than 3 number of reviews.amenity_count: Count of amenity provided by the host.mb_distance: The distance (in km) between the room/unit and the Marina Bay of Singapore.price_per_night; Numerical value of the price.normalised_rating: Normalised value of the review scores rating.where:
superhost_status = 1 if host_is_superhost is TRUE, 0 otherwise.popular_status = 1 if number_of_reviews > 3, 0 otherwise.amenity_count = Number of phrases in amenities text, delimited by “,”.mb_distance = Computed using latitute and longitute of the room/unit and Marina Bay (latitute = 1.2823, longitute = 103.8585).price_per_night = Numerical value of the price without the currency symbol.normalised_rating = Review scores rating normalised to range [0, 5].mbDistance <- function(lat, long) {
degree_to_km <- 111.139;
degree_to_km*((1.2823 - lat)**2 + (103.8585 - long)**2)**0.5
}
airbnb_data <- airbnb_data %>%
mutate(host_is_superhost = replace_na(host_is_superhost, F)) %>%
mutate(superhost_status = ifelse(host_is_superhost, 1, 0)) %>%
mutate(popular_status = ifelse(number_of_reviews > 3, 1, 0)) %>%
mutate(amenity_count = lengths(str_split(amenities, "[,]"))) %>%
mutate(mb_distance = unlist(map2(latitude, longitude, mbDistance))) %>%
mutate(price_per_night = parse_number(price)) %>%
mutate(normalised_rating = round((review_scores_rating / 100 * 5), digits = 3))
glimpse(airbnb_data)
## Rows: 4,763
## Columns: 13
## $ review_scores_rating <dbl> 94, 91, 98, 89, 83, 88, 82, 99, 99, 98, 89, 90, …
## $ number_of_reviews <dbl> 1, 18, 20, 20, 24, 48, 29, 176, 199, 236, 133, 1…
## $ price <chr> "$87.00", "$80.00", "$72.00", "$214.00", "$99.00…
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ amenities <chr> "{TV,\"Cable TV\",Internet,Wifi,\"Air conditioni…
## $ latitude <dbl> 1.44255, 1.33235, 1.44246, 1.34541, 1.34567, 1.3…
## $ longitude <dbl> 103.7958, 103.7852, 103.7967, 103.9571, 103.9596…
## $ superhost_status <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, …
## $ popular_status <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ amenity_count <int> 9, 13, 10, 26, 22, 19, 24, 37, 36, 36, 28, 16, 2…
## $ mb_distance <dbl> 19.124743, 9.863501, 19.080393, 13.012653, 13.26…
## $ price_per_night <dbl> 87, 80, 72, 214, 99, 109, 217, 52, 54, 39, 62, 4…
## $ normalised_rating <dbl> 4.70, 4.55, 4.90, 4.45, 4.15, 4.40, 4.10, 4.95, …
To start modeling, we sample 1000 listings from the 4763 listings with defined review scores rating.
set.seed(20200505)
airbnb_data_n <- airbnb_data %>%
sample_n(1000)
library(broom)
We will visualised if there are any patterns on the relationship between normalised_rating and amenity_count, as well as the relationship between normalised_rating and mb_distance.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
In both the plots, the points seems more cluttered on top, indicating a large number of perfect or near perfect ratings.
We transform the review_scores_rating as:
where lower bad_score corresponds to higher normalised_rating, and vice-versus.
airbnb_data_n <- airbnb_data_n %>%
mutate(bad_score = (max(normalised_rating) - normalised_rating)**0.125)
We Visualised the relationship using Bad Score (bad_score):
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
For the preparation of the model, we created and ran a correlational matrix, to see how our variables of interest (within the model) are related.
airbnb_data_n %$%
cor(tibble(bad_score, amenity_count, mb_distance,
superhost_status, popular_status)) %>%
round(., 2)
## bad_score amenity_count mb_distance superhost_status
## bad_score 1.00 -0.09 -0.06 -0.08
## amenity_count -0.09 1.00 -0.15 0.26
## mb_distance -0.06 -0.15 1.00 0.00
## superhost_status -0.08 0.26 0.00 1.00
## popular_status 0.49 0.05 0.02 0.13
## popular_status
## bad_score 0.49
## amenity_count 0.05
## mb_distance 0.02
## superhost_status 0.13
## popular_status 1.00
airbnb_data_n %>%
select(bad_score, amenity_count, mb_distance,
superhost_status, popular_status) %>%
as.matrix(.) %>%
Hmisc::rcorr(.) %>%
tidy(.) %>% as_tibble()
## # A tibble: 10 x 5
## column1 column2 estimate n p.value
## <chr> <chr> <dbl> <int> <dbl>
## 1 bad_score amenity_count -0.0922 1000 0.00353
## 2 bad_score mb_distance -0.0634 1000 0.0449
## 3 amenity_count mb_distance -0.152 1000 0.00000137
## 4 bad_score superhost_status -0.0842 1000 0.00773
## 5 amenity_count superhost_status 0.257 1000 0
## 6 mb_distance superhost_status 0.000441 1000 0.989
## 7 bad_score popular_status 0.491 1000 0
## 8 amenity_count popular_status 0.0534 1000 0.0914
## 9 mb_distance popular_status 0.0192 1000 0.545
## 10 superhost_status popular_status 0.134 1000 0.0000211
Then we performed mean-centering transformations on all the variables that will be turned into interaction terms.
airbnb_data2_n <- airbnb_data_n %>%
mutate_at(vars(amenity_count:mb_distance), funs(. - mean(., na.rm=T)))
glimpse(airbnb_data2_n)
## Rows: 1,000
## Columns: 14
## $ review_scores_rating <dbl> 63, 100, 96, 100, 90, 92, 96, 100, 100, 90, 98, …
## $ number_of_reviews <dbl> 6, 1, 146, 1, 20, 19, 19, 1, 5, 2, 24, 1, 33, 1,…
## $ price <chr> "$80.00", "$57.00", "$220.00", "$148.00", "$110.…
## $ host_is_superhost <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, F…
## $ amenities <chr> "{TV,Wifi,\"Air conditioning\",Kitchen,Elevator,…
## $ latitude <dbl> 1.30456, 1.31111, 1.31957, 1.30631, 1.30733, 1.2…
## $ longitude <dbl> 103.8347, 103.8582, 103.8465, 103.8481, 103.8471…
## $ superhost_status <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
## $ popular_status <dbl> 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, …
## $ amenity_count <dbl> -14.474, 4.526, 15.526, 11.526, -11.474, -21.474…
## $ mb_distance <dbl> -1.53175155, -1.95222147, -0.80274885, -2.246287…
## $ price_per_night <dbl> 80, 57, 220, 148, 110, 25, 80, 33, 188, 238, 30,…
## $ normalised_rating <dbl> 3.15, 5.00, 4.80, 5.00, 4.50, 4.60, 4.80, 5.00, …
## $ bad_score <dbl> 1.0799321, 0.0000000, 0.8177654, 0.0000000, 0.91…
We ran two regression models. The first regressed listing factors (superhost_status, popular_status), number of amenities provided by host (amenity_count) and distance to Marina Bay (mb_distance) onto bad score (model1). Our key investigation lies in the next model, in which we regressed amenity_count and mb_distance, along with interaction terms (superhost_status, popular_status), onto bad score (bad_score) (model2).
model1 <- lm(data = airbnb_data2_n,
bad_score ~
superhost_status + popular_status + mb_distance + amenity_count)
tidy(model1) %>% as_tibble()
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.392 0.0188 20.8 5.47e-80
## 2 superhost_status -0.127 0.0281 -4.54 6.44e- 6
## 3 popular_status 0.449 0.0237 19.0 1.20e-68
## 4 mb_distance -0.00943 0.00291 -3.25 1.21e- 3
## 5 amenity_count -0.00481 0.00135 -3.56 3.90e- 4
glance(model1)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.279 0.276 0.364 96.1 3.74e-69 5 -406. 825. 854.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
model2 <- lm(data = airbnb_data2_n,
bad_score ~
superhost_status + popular_status * (mb_distance + amenity_count))
tidy(model2) %>% as_tibble()
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.389 0.0188 20.6 4.60e-79
## 2 superhost_status -0.126 0.0281 -4.49 8.13e- 6
## 3 popular_status 0.451 0.0237 19.0 3.78e-69
## 4 mb_distance -0.0138 0.00484 -2.84 4.59e- 3
## 5 amenity_count -0.00853 0.00203 -4.21 2.83e- 5
## 6 popular_status:mb_distance 0.00619 0.00605 1.02 3.07e- 1
## 7 popular_status:amenity_count 0.00641 0.00264 2.43 1.55e- 2
glance(model2)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.283 0.279 0.363 65.4 1.66e-68 7 -403. 823. 862.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
Next we want to test if the Model 2 with interaction term will enhances the explanatory power using anova function.
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: bad_score ~ superhost_status + popular_status + mb_distance +
## amenity_count
## Model 2: bad_score ~ superhost_status + popular_status * (mb_distance +
## amenity_count)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 995 131.99
## 2 993 131.16 2 0.82477 3.1221 0.0445 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result indicates that adding the interaction terms significantly increases the R-squared of model 2, as compared to model 1.
We check the linear assumptions for model 2 using the ‘gvlma’ library.
library(gvlma)
gvlma(model2)
##
## Call:
## lm(formula = bad_score ~ superhost_status + popular_status *
## (mb_distance + amenity_count), data = airbnb_data2_n)
##
## Coefficients:
## (Intercept) superhost_status
## 0.388886 -0.125977
## popular_status mb_distance
## 0.450558 -0.013751
## amenity_count popular_status:mb_distance
## -0.008526 0.006185
## popular_status:amenity_count
## 0.006407
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = model2)
##
## Value p-value Decision
## Global Stat 0.333300 0.9876 Assumptions acceptable.
## Skewness 0.237791 0.6258 Assumptions acceptable.
## Kurtosis 0.057621 0.8103 Assumptions acceptable.
## Link Function 0.030287 0.8618 Assumptions acceptable.
## Heteroscedasticity 0.007601 0.9305 Assumptions acceptable.
Next we do multi-colliniearity check for model 2 using the ‘car’ library, to ensure no colliniearity amount the predictor variables
library(car)
vif(model2)
## superhost_status popular_status
## 1.092442 1.019871
## mb_distance amenity_count
## 2.857631 2.482130
## popular_status:mb_distance popular_status:amenity_count
## 2.814296 2.341797
We further investigate the significant of the coefficients of Model 2.
summary(model2)
##
## Call:
## lm(formula = bad_score ~ superhost_status + popular_status *
## (mb_distance + amenity_count), data = airbnb_data2_n)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.89056 -0.27203 0.05127 0.14168 0.90917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.388886 0.018838 20.644 < 2e-16 ***
## superhost_status -0.125977 0.028087 -4.485 8.13e-06 ***
## popular_status 0.450558 0.023663 19.041 < 2e-16 ***
## mb_distance -0.013751 0.004840 -2.841 0.00459 **
## amenity_count -0.008526 0.002027 -4.206 2.83e-05 ***
## popular_status:mb_distance 0.006185 0.006053 1.022 0.30705
## popular_status:amenity_count 0.006407 0.002641 2.426 0.01546 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3634 on 993 degrees of freedom
## Multiple R-squared: 0.2832, Adjusted R-squared: 0.2788
## F-statistic: 65.37 on 6 and 993 DF, p-value: < 2.2e-16
library(knitr)
kable(tidy(model2))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.3888864 | 0.0188378 | 20.643992 | 0.0000000 |
| superhost_status | -0.1259773 | 0.0280866 | -4.485313 | 0.0000081 |
| popular_status | 0.4505584 | 0.0236626 | 19.040943 | 0.0000000 |
| mb_distance | -0.0137506 | 0.0048405 | -2.840753 | 0.0045927 |
| amenity_count | -0.0085259 | 0.0020270 | -4.206202 | 0.0000283 |
| popular_status:mb_distance | 0.0061855 | 0.0060526 | 1.021956 | 0.3070504 |
| popular_status:amenity_count | 0.0064072 | 0.0026414 | 2.425666 | 0.0154577 |
kable(glance(model2))
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.2831563 | 0.2788249 | 0.3634375 | 65.3732 | 0 | 7 | -403.2782 | 822.5564 | 861.8184 | 131.1622 | 993 |
The regression analysis came up with one significant interaction term.
It appears that the relationships between bad score and amenity count is different depending on the listings popular status.
To see the patterns of interaction, we visualized the significant interaction effects on the next section.
To visualize the OLS regression analysis performed above, we stored the OLS regression model’s predictions.
library(modelr)
grid <- airbnb_data2_n %>%
data_grid(amenity_count, popular_status, mb_distance = 0, superhost_status = 0) %>%
add_predictions(model2)
We undid the centering of variable (amenity_count)
grid <- grid %>%
mutate(amenity_count = amenity_count + mean(airbnb_data_n$amenity_count))
The following figure shows 2 lines that represents the popular status of the listing, on how the listing popularity differs in its relationships between amenity count and bad score.
We also plot the listing popularity levels separately, along with data points represented with dots, so that the patterns of the relationship can be seen from a different angle.
The result has one very surprising finding:
The relationships between the number of amenities provided by host and the bad scores differ depending on the popularity of the listing. Specifically, there appears that popular listings (listings with number of reviews > 3) has a higher bad score (lower review scores rating) compare to less popular listings (listings with number of reviews ≤ 3), regardless of the number of amenities provided by host.
In addition to that, less popular listings has a much better reviews with more amenities, compared to more popular listings.
Interaction Model without mean-centering
Plot interaction using interact_plot() to account for confidence intervals (CIs)
# Interaction Model
model2 <- lm(data = airbnb_data_n,
bad_score ~
superhost_status + popular_status * (mb_distance + amenity_count))
Run simple slopes analysis & spotlight analysis using Johnson-Neyman Techniques
sim_slopes(model2, pred=amenity_count, modx=popular_status, johnson_neyman=F)
## SIMPLE SLOPES ANALYSIS
##
## Slope of amenity_count when popular_status = 0.00 (0):
##
## Est. S.E. t val. p
## ------- ------ -------- ------
## -0.01 0.00 -4.21 0.00
##
## Slope of amenity_count when popular_status = 1.00 (1):
##
## Est. S.E. t val. p
## ------- ------ -------- ------
## -0.00 0.00 -1.20 0.23
sim_slopes(model2, pred=popular_status, modx=amenity_count, johnson_neyman=T)
## JOHNSON-NEYMAN INTERVAL
##
## When amenity_count is OUTSIDE the interval [-344.53, -15.00], the slope of
## popular_status is p < .05.
##
## Note: The range of observed values of amenity_count is [2.00, 55.00]
##
## SIMPLE SLOPES ANALYSIS
##
## Slope of popular_status when amenity_count = 14.54 (- 1 SD):
##
## Est. S.E. t val. p
## ------ ------ -------- ------
## 0.39 0.03 11.86 0.00
##
## Slope of popular_status when amenity_count = 23.47 (Mean):
##
## Est. S.E. t val. p
## ------ ------ -------- ------
## 0.45 0.02 19.04 0.00
##
## Slope of popular_status when amenity_count = 32.41 (+ 1 SD):
##
## Est. S.E. t val. p
## ------ ------ -------- ------
## 0.51 0.03 15.07 0.00
The result indicates that for amenity_count ≥ -15, the slope of popular status is p < 0.5. Since the number of amenities are always more than zero, the slope is significant at all values of amenity_count.
Run interaction_plot() again by adding benchmark for regions of significance
As the price per night is our response variable, lets investigate the price per night and its distribution using the histogram.
airbnb_data_n %>%
ggplot() +
geom_histogram(mapping = aes(x = price_per_night), binwidth = 10)
Since the price is right skew, we are using logarithm to transform the price_per_night data and create another one more independent variables as:
airbnb_data_n <- airbnb_data_n %>%
mutate(log_price = log(price_per_night))
airbnb_data_n %>%
ggplot() +
geom_histogram(mapping = aes(x = log_price), binwidth = 1)
We will first visualise if there are any patterns on the relationship between price_per_night and mb_distance.
## `geom_smooth()` using formula 'y ~ x'
Next visualise on the relationship between price_per_night and number of amenities provided.
## `geom_smooth()` using formula 'y ~ x'
For the preparation of the model, we created and ran a correlational matrix, to see how our variables of interest (within the model) are related.
airbnb_data_n %$%
cor(tibble(
log_price,
popular_status,
amenity_count,
mb_distance,
superhost_status
)) %>%
round(., 2)
## log_price popular_status amenity_count mb_distance
## log_price 1.00 -0.01 0.26 -0.21
## popular_status -0.01 1.00 0.05 0.02
## amenity_count 0.26 0.05 1.00 -0.15
## mb_distance -0.21 0.02 -0.15 1.00
## superhost_status 0.17 0.13 0.26 0.00
## superhost_status
## log_price 0.17
## popular_status 0.13
## amenity_count 0.26
## mb_distance 0.00
## superhost_status 1.00
airbnb_data_n %>%
select(log_price, popular_status, amenity_count, mb_distance, superhost_status) %>%
as.matrix(.) %>%
Hmisc::rcorr(.) %>%
tidy(.) %>% as_tibble()
## # A tibble: 10 x 5
## column1 column2 estimate n p.value
## <chr> <chr> <dbl> <int> <dbl>
## 1 log_price popular_status -0.00692 1000 8.27e- 1
## 2 log_price amenity_count 0.258 1000 0.
## 3 popular_status amenity_count 0.0534 1000 9.14e- 2
## 4 log_price mb_distance -0.209 1000 2.44e-11
## 5 popular_status mb_distance 0.0192 1000 5.45e- 1
## 6 amenity_count mb_distance -0.152 1000 1.37e- 6
## 7 log_price superhost_status 0.171 1000 5.66e- 8
## 8 popular_status superhost_status 0.134 1000 2.11e- 5
## 9 amenity_count superhost_status 0.257 1000 0.
## 10 mb_distance superhost_status 0.000441 1000 9.89e- 1
Then we performed mean-centering transformations on all the variables that will be turned into interaction terms.
airbnb_data2_n <- airbnb_data_n %>%
mutate_at(vars(amenity_count:mb_distance), funs(. - mean(., na.rm = T)))
glimpse(airbnb_data2_n)
## Rows: 1,000
## Columns: 15
## $ review_scores_rating <dbl> 63, 100, 96, 100, 90, 92, 96, 100, 100, 90, 98, …
## $ number_of_reviews <dbl> 6, 1, 146, 1, 20, 19, 19, 1, 5, 2, 24, 1, 33, 1,…
## $ price <chr> "$80.00", "$57.00", "$220.00", "$148.00", "$110.…
## $ host_is_superhost <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, F…
## $ amenities <chr> "{TV,Wifi,\"Air conditioning\",Kitchen,Elevator,…
## $ latitude <dbl> 1.30456, 1.31111, 1.31957, 1.30631, 1.30733, 1.2…
## $ longitude <dbl> 103.8347, 103.8582, 103.8465, 103.8481, 103.8471…
## $ superhost_status <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
## $ popular_status <dbl> 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, …
## $ amenity_count <dbl> -14.474, 4.526, 15.526, 11.526, -11.474, -21.474…
## $ mb_distance <dbl> -1.53175155, -1.95222147, -0.80274885, -2.246287…
## $ price_per_night <dbl> 80, 57, 220, 148, 110, 25, 80, 33, 188, 238, 30,…
## $ normalised_rating <dbl> 3.15, 5.00, 4.80, 5.00, 4.50, 4.60, 4.80, 5.00, …
## $ bad_score <dbl> 1.0799321, 0.0000000, 0.8177654, 0.0000000, 0.91…
## $ log_price <dbl> 4.382027, 4.043051, 5.393628, 4.997212, 4.700480…
We ran two regression models. The first regressed listing factors (superhost_status, popular_status), number of amenities provided by host (amenity_count) and distance to Marina Bay (mb_distance) onto logarithm price per night (log_price) (model3). Our key investigation lies in the next model, in which we regressed amenity_count and mb_distance, along with interaction terms (superhost_status, popular_status), onto logarithm price per night (log_price) (model4).
model3 <-
lm(data = airbnb_data2_n,
log_price ~ superhost_status + popular_status + amenity_count + mb_distance)
tidy(model3) %>% as_tibble()
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.66 0.0356 131. 0.
## 2 superhost_status 0.210 0.0531 3.95 8.31e- 5
## 3 popular_status -0.0456 0.0448 -1.02 3.09e- 1
## 4 amenity_count 0.0163 0.00255 6.40 2.32e-10
## 5 mb_distance -0.0323 0.00549 -5.88 5.68e- 9
glance(model3)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.110 0.107 0.688 30.8 3.28e-24 5 -1042. 2096. 2126.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
model4 <-
lm(data = airbnb_data2_n,
log_price ~ superhost_status + popular_status * (amenity_count + mb_distance))
tidy(model4) %>% as_tibble()
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.65 0.0355 131. 0
## 2 superhost_status 0.219 0.0529 4.13 0.0000391
## 3 popular_status -0.0474 0.0446 -1.06 0.288
## 4 amenity_count 0.0140 0.00382 3.66 0.000261
## 5 mb_distance -0.0122 0.00912 -1.33 0.183
## 6 popular_status:amenity_count 0.00496 0.00498 0.997 0.319
## 7 popular_status:mb_distance -0.0321 0.0114 -2.82 0.00496
glance(model4)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.119 0.114 0.685 22.4 7.33e-25 7 -1037. 2090. 2129.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
Next we want to test if the Model 4 with interaction term will enhances the explanatory power using anova function.
anova(model3, model4)
## Analysis of Variance Table
##
## Model 1: log_price ~ superhost_status + popular_status + amenity_count +
## mb_distance
## Model 2: log_price ~ superhost_status + popular_status * (amenity_count +
## mb_distance)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 995 470.59
## 2 993 465.78 2 4.8061 5.1231 0.006116 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result indicates that adding interaction term improvements the explanatory power of the model.
We check the linear assumptions for model 4 using the ‘gvlma’ library.
gvlma(model4)
##
## Call:
## lm(formula = log_price ~ superhost_status + popular_status *
## (amenity_count + mb_distance), data = airbnb_data2_n)
##
## Coefficients:
## (Intercept) superhost_status
## 4.654959 0.218672
## popular_status amenity_count
## -0.047439 0.013998
## mb_distance popular_status:amenity_count
## -0.012160 0.004962
## popular_status:mb_distance
## -0.032119
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = model4)
##
## Value p-value Decision
## Global Stat 6.589906 0.15921 Assumptions acceptable.
## Skewness 3.157697 0.07557 Assumptions acceptable.
## Kurtosis 1.053528 0.30470 Assumptions acceptable.
## Link Function 2.371216 0.12359 Assumptions acceptable.
## Heteroscedasticity 0.007465 0.93115 Assumptions acceptable.
Next we do multi-colliniearity check for model 4 using the ‘car’ library, to ensure no colliniearity amount the predictor variables
vif(model4)
## superhost_status popular_status
## 1.092442 1.019871
## amenity_count mb_distance
## 2.482130 2.857631
## popular_status:amenity_count popular_status:mb_distance
## 2.341797 2.814296
We further investigate the significant of the coefficients of Model 4.
summary(model4)
##
## Call:
## lm(formula = log_price ~ superhost_status + popular_status *
## (amenity_count + mb_distance), data = airbnb_data2_n)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.84001 -0.43216 0.00149 0.44042 2.61769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.654959 0.035499 131.129 < 2e-16 ***
## superhost_status 0.218672 0.052928 4.131 3.91e-05 ***
## popular_status -0.047439 0.044591 -1.064 0.287652
## amenity_count 0.013998 0.003820 3.665 0.000261 ***
## mb_distance -0.012160 0.009122 -1.333 0.182811
## popular_status:amenity_count 0.004962 0.004978 0.997 0.319041
## popular_status:mb_distance -0.032119 0.011406 -2.816 0.004958 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6849 on 993 degrees of freedom
## Multiple R-squared: 0.1193, Adjusted R-squared: 0.114
## F-statistic: 22.42 on 6 and 993 DF, p-value: < 2.2e-16
There is a significant effects for distance to Marina Bay (mb_distance) on logarithm price per night (log_price) that depends on the popular status of the listing (popular_status).
kable(tidy(model4))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 4.6549585 | 0.0354991 | 131.1289900 | 0.0000000 |
| superhost_status | 0.2186724 | 0.0529282 | 4.1314879 | 0.0000391 |
| popular_status | -0.0474388 | 0.0445914 | -1.0638569 | 0.2876522 |
| amenity_count | 0.0139975 | 0.0038198 | 3.6645049 | 0.0002609 |
| mb_distance | -0.0121600 | 0.0091217 | -1.3330800 | 0.1828113 |
| popular_status:amenity_count | 0.0049624 | 0.0049776 | 0.9969309 | 0.3190409 |
| popular_status:mb_distance | -0.0321194 | 0.0114059 | -2.8160276 | 0.0049584 |
kable(glance(model4))
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.1193063 | 0.1139849 | 0.684885 | 22.42004 | 0 | 7 | -1036.922 | 2089.844 | 2129.106 | 465.784 | 993 |
The regression analysis came up with one significant interaction term.
It appears that the relationships between logarithm price per night (log_price) and distance to Marina Bay (mb_distance) is different depending on the popularity of the listing (popular_status).
To see the patterns of interaction, we visualized the significant interaction effects on the next section.
To visualize the OLS regression analysis performed above, we stored the OLS regression model’s predictions.
grid <- airbnb_data2_n %>%
data_grid(mb_distance, popular_status, amenity_count=0, superhost_status=0) %>%
add_predictions(model4)
We undid the centering of variable (mb_distance).
grid <- grid %>%
mutate(mb_distance = mb_distance + mean(airbnb_data_n$mb_distance))
The following figure represents the two lines that represent differing popular status, and how each popular status differs in its relationships between distance to Marina Bay and logrithm price per night.
We also plotted the two popular status separately, along with data points represented with dots, so that we could see the patterns of the relationship from a different angle.
The result has one very interesting finding:
The relationships between the log price per night and distance to Marina Bay have depending on the popular status of the listing (popular_status). For listing further from Marina Bay, popular listing has a lower price per night, compare to non-popular listing. For listing nearer to Marina Bay, there are not much difference between price per night.
Interaction Model without mean-centering
Plot interaction using interact_plot() to account for confidence intervals (CIs)
# Interaction Model
model4 <-
lm(data = airbnb_data_n,
log_price ~ superhost_status + popular_status * (amenity_count + mb_distance))
Run simple slopes analysis & spotlight analysis using Johnson-Neyman Techniques
sim_slopes(model4,
pred = mb_distance,
modx = popular_status,
johnson_neyman = F)
## SIMPLE SLOPES ANALYSIS
##
## Slope of mb_distance when popular_status = 0.00 (0):
##
## Est. S.E. t val. p
## ------- ------ -------- ------
## -0.01 0.01 -1.33 0.18
##
## Slope of mb_distance when popular_status = 1.00 (1):
##
## Est. S.E. t val. p
## ------- ------ -------- ------
## -0.04 0.01 -6.47 0.00
sim_slopes(model4,
pred = popular_status,
modx = mb_distance,
johnson_neyman = T)
## JOHNSON-NEYMAN INTERVAL
##
## When mb_distance is OUTSIDE the interval [-1.88, 6.60], the slope of
## popular_status is p < .05.
##
## Note: The range of observed values of mb_distance is [0.26, 19.75]
##
## SIMPLE SLOPES ANALYSIS
##
## Slope of popular_status when mb_distance = 1.14 (- 1 SD):
##
## Est. S.E. t val. p
## ------ ------ -------- ------
## 0.08 0.06 1.29 0.20
##
## Slope of popular_status when mb_distance = 5.15 (Mean):
##
## Est. S.E. t val. p
## ------- ------ -------- ------
## -0.05 0.04 -1.06 0.29
##
## Slope of popular_status when mb_distance = 9.17 (+ 1 SD):
##
## Est. S.E. t val. p
## ------- ------ -------- ------
## -0.18 0.06 -2.73 0.01
The result indicates the slope is significant for mb_distance > 6.6.
Run interaction_plot() again by adding benchmark for regions of significance
However, this is an observational study, leading to correlations, not causations, in Limitations and Future Directions section below.
Part 1 model suggested that listings with more amenities provided will have a better review scores rating, compare to listings with less amenities provided. However, for popular listing with more than 3 reviews, providing more amenities does not improve the reviews score rating much compare to less popular listing. Guests who stay in more popular listing does not pay much attention to the amenities provided, compared to guests who stay in less popular listing. This maybe because for popular listing there are more reviews, and they may rate the listing based on other guests reviews.
Part 2 model suggested that both popular and non-popular listing price per night is higher when the distance closer to Marina Bay. However, when the listing further away from Marina Bay the popular listing price per night go much lower than non-popular listing. This might be the contributing factors for why popular listing were more attractive for more reviews which equate to more occupancy rate.
Here are our limitations:
The study data was taken from the http://insideairbnb.com/ a watchdog website launched by Murray Cox. We were basically depending on the accuracy of their data collection from the Airbnb Singapore site https://www.airbnb.com.sg/.
We also decied to use the dataset scraped from March 2020 before the Singapore Covid-19 Circuit Breaker implemented in April 2020.
Property type (property_type) of the rooms/units is not included in the study, as there are 26 categories from the dataset of 7395 observations, and many has 10 or less observations. For instance, Aparthotel - 10, Boat - 7, Bus, 1, Cabin 1, Campsite - 3, Chalet - 3, Earth house - 1, Heritage hotel (India) - 1, Igloo - 1, Tent - 6, Tiny house - 2.
Seasonal fluctuations on the prices has not included in the modeling, for example the spike in the prices during end of year or during holiday seasons.
Here are our future directions:
Further investigate to determine if the models are able to apply on other cities, and to study the accuracy of the models should be conducted.
The distance to marina bay is used as a variable in our models, since it is the iconic tourist venue in Singapore. Further investigate is needed to determine the number of shopping centers or tourist attractions around the rooms or units will have any significance influence to the price or review scores rating.
We are unable to determine the exact room/unit occupancy rate, as the information is not publicly available. This information is currently availabe on AirDNA with a premium of 17 USD per dataset. With the occupancy rate, we will be able to add another layer of insights to our models.
We have also notice that there are many listings with wide range of price, some are ridiculous high, that exceeds the top luxury hotel like The Capella Resort, Singapore in Sentosa. It is very difficult for the host to decide on the appropriate price.
We also looking into writing a powerful web scraping engine, so that we can scrape the latest data from Airbnb, and to check if any of the listings are outdated or taken down, as we notice that there are some inconsistancy in the price data, and missing listing from the Inside Airbnb dataset.
An innovative next generation state-of-the-art platform to enable host to estimate the price range of the good review room/units based on the input parameters, such as the city, popluar status, amenities, etc, using the latest Airbnb listing data scraped by our powerful web scraper engine, will be built.